Identification of long non-coding RNAs and RNA binding proteins in breast cancer subtypes

Breast cancer is a heterogeneous disease classified into four main subtypes with different clinical outcomes, such as patient survival, prognosis, and relapse. Current genetic tests for the differential diagnosis of BC subtypes showed a poor reproducibility. Therefore, an early and correct diagnosis of molecular subtypes is one of the challenges in the clinic. In the present study, we identified differentially expressed genes, long non-coding RNAs and RNA binding proteins for each BC subtype from a public dataset applying bioinformatics algorithms. In addition, we investigated their interactions and we proposed interacting biomarkers as potential signature specific for each BC subtype. We found a network of only 2 RBPs (RBM20 and PCDH20) and 2 genes (HOXB3 and RASSF7) for luminal A, a network of 21 RBPs and 53 genes for luminal B, a HER2-specific network of 14 RBPs and 30 genes, and a network of 54 RBPs and 302 genes for basal BC. We validated the signature considering their expression levels on an independent dataset evaluating their ability to classify the different molecular subtypes with a machine learning approach. Overall, we achieved good performances of classification with an accuracy >0.80. In addition, we found some interesting novel prognostic biomarkers such as RASSF7 for luminal A, DCTPP1 for luminal B, DHRS11, KLC3, NAGS, and TMEM98 for HER2, and ABHD14A and ADSSL1 for basal. The findings could provide preliminary evidence to identify putative new prognostic biomarkers and therapeutic targets for individual breast cancer subtypes.

Breast cancer (BC) is one of the most common cancers around the world and was estimated the most frequent cancer among women (25% of all new cancers recorded) 1 . The heterogeneity of BC reduces the specificity of biological features (e.g., histological grade and hormone receptor status) which are usually utilized for the diagnosis and prognosis of BC and to address a therapy 2,3 . The classification of biological BC subtypes is based on the use of techniques such as immunohistochemistry and gene expression profiling 4 .
In 2011 The St. Gallen International Breast Cancer Conference reported a molecular subtype approach to guide the therapy of BC based on immunohistochemical markers: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) 4 . In addition to the detection of these standard biomarkers, St. Gallen in 2013 included the evaluation of a marker of cell proliferation: Ki-67 5  Differential expression analysis. Quantile-adjusted conditional maximum likelihood was used as statistical test to detect differentially expressed RNAs between normal breast tissue and BC 32 . This analysis was performed considering each BC subtype at time compared to normal breast tissue. Those genes with a |log2(fold change)|≥ 1 and adjusted p-value < 0.05 were considered to have statistical significance. The p-values were adjusted using the Benjamini-Hochberg procedure for multiple testing correction 34 . We further identified significantly dysregulated RBPs based on RBP catalog of Hentze et al. 27 . To identify lncRNAs present in the RNA-seq data we consulted Ensembl Biomart 100 35 and LNCipedia database version 5.2 36 .
Venn Diagram among RNAs in the four BC subtypes was represented using the R-package VennDiagram 1.6.20 37 .
The functional role of differentially expressed lncRNAs was investigated using Cancer lncRNA Census 2.0 38 .

Protein-RNA interaction predictions.
Protein-RNA networks were retrieved from the RNAct database 39 containing predicted and experimentally validated 40,41 protein-RNA interactions (which is based on UniProt release 2017_10 and GENCODE release 27). Within RNAct, the binding abilities were predicted using the catRAPID algorithm  . A z-score is calculated based on the values for experimentally determined protein-RNA interactions from eCLIP data from the ENCODE project and provides a score for the interaction of a protein-RNA pair of interest (for more details 39 ).
Survival analysis. Survival analysis was performed in October 2021 from The Human Protein Atlas website 45 . Kaplan-Meier analysis investigated the prognosis of BC patients and the differences between the survival curves were explored with the log-rank tests 46 . We considered a gene/RBP to be prognostic if p-value < 0.05.

Machine learning approach.
We identified subtype-specific networks and we validated with a machine learning approach their ability to classify the four BC molecular subtypes. The performances of RNA interactions for each BC subtype were evaluated with a linear support vector machine (SVM) and random forest classifiers, using the R-package caret 6.0.86 47 . We used default parameters for linear SVM and random forest classifiers. RNA expression levels were normalized to make their scale comparable with the caret function "preprocess". We used an independent GEO dataset (GSE58212) that includes: 121 luminal A, 69 luminal B, 32 HER2, and 36 basal samples.

Results
Differentially expressed long non-coding RNAs. By comparing the four breast cancer subtypes to their respective normal tissues, we found 3199 differentially expressed genes (DEGs) from the comparison "Luminal A vs. NS", 4074 from "luminal B vs. NS", 4134 from "HER2 vs. NS", and 4181 from "basal vs. NS". Sup- Figure 1. Workflow of the computational approach. The computational method was applied considering the comparison of each breast cancer (BC) subtype vs normal breast tissue. Differential expressional analysis with quantile-adjusted conditional maximum likelihood was performed on The Cancer Genome Atlas (TCGA) data between each breast cancer subtype and normal breast tissue. The analysis identified differentially expressed genes (DEGs), differentially long non-coding RNAs (DE lncRNAs) and differentially RNA-binding proteins (DE RBPs   Table 1 shows differentially expressed long non-coding for each BC subtype. 1 lncRNA was found only in luminal A (TSIX), 1 lncRNA was found only in luminal B (PVT1), 2 lncRNAs were found only in HER2 (SNHG8 and GAS5), and 3 lncRNAs were found only in basal (HCP5, SNHG3 and MIR155HG). 3 lncRNAs were found in common among 4 BC subtypes (HOTAIR, EMX2OS and MYCNOS).  Table 2 shows the functional information of lncRNAs extracted by Cancer lncRNA Census 2.0. We found that 14 of 15 lncRNAs were previously associated with different cancer types. GAS5, PVT1, and XIST showed a dual role as tumor suppressors and oncogenes. DLEU2 and EGOT play a role in cancer as tumor suppressors. HCG11, HCP5, HOTAIR, MYCNOS-01, PART1, SNHG3, SNHG5, SNHG8, and UCA1 were oncogenes in different caner types. To date, MIR155HG is not associated with a clear function with cancer. and identified significantly dysregulated RBPs based on the differential expression analysis as described above. Among the 3199 differentially expressed genes in luminal A, there were 99 out of 1393 RBPs, among the 4074 DEGs in luminal B there were 176 RBPs. 155 RBPs out of 4134 DEGs were found in HER2 and among the 4181 DEGs there were 179 RBPs. Table 3 summarizes the results of differential expression analyses considering long non-coding RNAs and RNA-binding proteins compared to normal tissues.
4 RBPs (NUDT16L1, RBM20, PCDH20, and PCBP3) were found only in luminal A, 29 RBPs were found only in luminal B, 23 RBPs were found only in HER2, and 62 RBPs were found only in basal (Fig. 2B). Supplementary file 2 shows the list of RBPs for each subtype.
Interactions of RNA-binding proteins. Overall, from the differential expression analysis we found 5980 unique genes for all subtypes, 19 unique lncRNAs, and 281 unique RBPs.
We obtained interaction predictions for the 5980 RNAs and 281 RBPs from the RNAct protein-RNA interaction database. Interaction predictions are prioritized by a normalized score (z-score). The distribution of z-scores in our data is shown in the Supplementary Fig. 3. We selected the protein-RNA interactions that obtained a z-score ≥ 2 since this should provide a good balance between sensitivity and specificity. We obtained a network of 2585 nodes (241 RBPs, 7 lncRNAs, and 2337 RNAs), altered in at least one BC subtype, with 45,727 interactions. Supplementary file 4 shows these interactions.
We selected direct interactions involving only differentially expressed RBPs and genes present in a single subtype, namely, we considered the subtype-specific interactions. For luminal A we found a network consisted of 2 RBPs (RBM20 and PCDH20) and 2 genes (HOXB3 and RASSF7). The specific network for luminal B includes 21 RBPs and 53 DEGs. The specific network for HER2 includes 14 RBPs and 30 DEGs. The specific network for basal includes 54 RBPs and 302 DEGs. However, we did not find lncRNAs specific for a BC subtype interacting with the RBPs and DEGs specific for the same BC subtype.
From the subtype-specific interactions we evaluated the number of RNA targets for each RBP and the number of RBP for each RNA (Fig. 3).  www.nature.com/scientificreports/ We found that dysregulated RNAs in luminal A have a lower number of differentially expressed RBPs than dysregulated RNAs in luminal B, HER2 and basal BC (p-value < 0.001). We can explain these results indicating that the abnormal expression of RBPs increase with the worsening of the prognosis. In addition, the number of altered RNAs regulated by each altered RBP is directly proportional to the aggressiveness of the BC subtype, suggesting RBP as potential biomarkers responsible of BC subtypes differentiation.
We analyzed the prognostic role of subtype-specific RBPs and genes (Table 4). We found that 43% of specific RBPs for luminal B, 57% of specific RBPs for HER2, and 46% of specific RBPs for basal can differentiate BC patients with good and poor prognosis. 50% of specific genes for luminal A, 51% of specific genes for luminal B, 40% of specific genes for HER2, and 40% of specific genes for basal may influence BC patient survival.
We found a lower p-value (p-value = 0.00001) for DCTPP1 and NFKBIE.
Validation of protein-RNA interactions for each BC subtype. We validated subtype-specific interactions on an independent dataset from the NCBI Gene Expression Omnibus (GEO), GSE58212. We used the expression levels of interacting biomarkers for the classification, namely the expression levels of 2 RBPs (RBM20 and PCDH20) and 2 genes (HOXB3 and RASSF7) that we obtained for luminal A, the 21 RBPs and 53 DEGs obtained for luminal B, the 14 RBPs and 30 DEGs for HER2, and 54 RBPs and 302 DEGs obtained for basal. Using BC subtype-specific interactions we trained (75% of the original dataset) and tested (25% of the original dataset) a classifier with a linear support vector machine (SVM) and random forest models obtaining good performances (Table 5). We achieved the best accuracy from the comparison HER2 vs other subtypes (accuracy = 0.96) and basal vs other subtypes (accuracy = 0.96) using SVM classifier. Good performances were achieved with both classifiers for luminal A vs other subtypes and basal vs other subtypes. Low sensitivity was obtained in luminal B vs other subtypes and HER2 vs other subtypes with random forest classification.
TSIX mediates the X chromosome inactivation acting as a XIST repressor. Indeed, a strong inverse correlation between XIST and TSIX was demonstrated: a down-regulation of TSIX leads to an up-regulation of XIST that causes the inactivation of the X chromosome 48 . A previous study demonstrated that TSIX together with other lncRNAs such as OIP5-AS1, TUG1, NEAT1, MALAT1, and XIST were able to synergistically regulate cancer genes and pathways across different cancer types 49 . In addition, TSIX was differentially expressed in lung cancer 50 .
A previous study reported that PVT1 was up-regulated in BC tissues when compared with the normal tissues and its silencing repressed tumor growth 51 . PVT1 is associated with other types of cancers such as lung and   ovarian cancer and is correlated with the survival of patients 52 . Furthermore, PVT1 regulates several cancers processes and pathways such ad cell-cell adhesion and TGF-β signaling pathway. A previous study reported that PVT1 was co-expressed with another gene in the TGF-β signaling pathway. Indeed, PVT1 can regulate the protein stability of the MYC oncogenic protein 53 . In our study, 2 HER2-specific lncRNAs (SNHG8 and GAS5) were found. SNHG8 was overexpressed in different types of cancer, suggesting its role in the progression of these tumors. The silencing of SNHG8 inhibits the proliferation and invasion of BC cells, MCF-7 and ZR-75-30 54 . A recent study showed a possible molecular mechanism of action of SNHG8: it is a sponge for miR-656-3p and modulates SATB1 expression. SATB1 is associated with cancer cell proliferation, migration, and invasion 55 .
GAS5 is a strong candidate as prognostic biomarker since it was identified significantly downregulated in BC and correlated with poor prognosis. In addition, GAS5 is also a potential drug target since it is implicated in resistance to multiple drugs in BC such as tamoxifen and lapatinib 56 .
3 basal-specific lncRNAs (HCP5, SNHG3 and MIR155HG) were found. HCP5 was positively associated with the expression of immune checkpoints since it is mainly found expressed in immune system cells. In addition, www.nature.com/scientificreports/ HCP5 promotes tumor growth in vivo and in vitro as well as apoptosis and proliferation 57 . A recent study suggested that HCP5 could be a promising drug target in triple negative BC 58 .
The oncogene SNHG3 was up-regulated in BC cells and is associated with the growth of cell proliferation regulating tRNA processing and signal transduction. A recent study suggested that the down-regulation of SNHG3 might act as a possible therapeutic strategy for BC. In addition, it was demonstrated an inverse correlation between SNGH3 and miR-154-5p: increase of SNHG3 inhibits miR-154-5p and upregulates BC cell proliferation 59 .
Few is known about the role of MIR155HG in BC. MIR155HG is a precursor of miR-155-5p and was identified as a direct target of FOX3 40 . A recent study proposed the study of the expression MIR155HG together with FANCI and C-MYC as potential diagnostic test and drug targets in gynaecological malignancies 60 .
Then, in our study we focused on subtype-specific RBPs. 4 RBPs were found only in luminal A, 29 RBPs were found only in luminal B, 23 RBPs were found only in HER2, and 62 RBPs were found only in basal (Fig. 2B).
We found NUDT16L1, RBM20, PCDH20, and PCBP3 as luminal A-specific RBPs. NUDT16L1, also called SDOS and TIRR, is a novel RBP that regulates several transcripts encoding for centrosomal proteins and has a key role controlling cilia formation. Cilia are organelles present on eukaryotic cells that plays a role in cell progression. However, for a long time NUDT16L1 has been little studied and novel uncharacterized associations with cancer must be studied 61 .
RBM20 plays a role in the familial cardiomyopathy acting on titin and tropomyosin, two proteins involved in the biomechanics of the striated muscle. It is also associated with fasting glucose regulating insulin damage in cardiac tissues. However, its role in cancer has not yet been demonstrated 62 .
PCDH20, member of subfamily of the cadherin family, is down-regulated in non-small cell lung cancer 63 , nasopharyngeal carcinoma 64 , and hepatocellular carcinoma 65 . The prognostic role of PCDH20 was reported in a recent study: patients with high PCDH20 expression showed a better overall survival than those with low PCDH20 expression in hepatocellular carcinoma 66 . The tumor-suppressor gene PCDH20 through the Wnt/βcatenin signaling pathway acts inhibiting cell proliferation and cell migration 67 .
PCBP3 was associated with favorable prognosis in pancreatic cancer. However, no previous study investigated the molecular mechanism of PCBP3 in carcinogenesis 68 .
Among 29 luminal B-specific RBPs we focused on GSTP1 and RRS1 because previous studies reported an interesting association with BC.
GSTP1 is involved in the drug resistance of tumor cells, including BC. A recent study showed that high expression of GSTP1 can activate the NF-κB signaling pathway in tumor associated macrophages (TAMs) and regulates the expression of IL-6 69 .
RRS1 is a crucial nuclear protein implicated in ribosome biogenesis. It was overexpressed in several human cancers including BC. In addition, elevated RRS1 expression levels were correlated with lymph node metastasis and unfavorable clinical outcome. Some evidence also provided new molecular mechanisms of RRS1 in the proliferation of BC through RPL11/MDM2/p53 pathway 70 .
Among 23 HER2-specific RBPs we identified ALDH18A1 and LASP1 as potential BC biomarkers as they were associated with BC in previous studies. ALDH18A1 is an enzyme implicated in the conversion of glutamine to proline through glutamate. Its over-expression increases proline levels and decreases cell survival in BC as well as reduces reactive oxygen species. ALDH18A1 is also associated with an oncogene, MYC able to regulate cell metabolism and key genes implicated in cancer 71,72 .
LASP1 is a well-known protein that interacts with many proteins regulating tumor cell migration and invasion. A previous study showed that LASP1 binds to Ago2 that plays a key role in BC cell motility in response to CXCR4 activity 73 .
Among 62 basal-specific RBPs we found as interesting biomarkers: SERPINH1 and DKC1. SERPINH1 plays a key role for the correct folding and secretion of different types of collagen and has previously been associated to cancer progression. Its overexpression is correlated with angiogenesis, migration, and invasion 74 . A previous study demonstrated that SERPINH1 is regulated by miR-148a-5p, a miRNA predictive of unfavorable prognosis 75 .
DKC1 is associated with a poor prognosis in BC. Indeed, patients with a higher expression of DKC1 metastasize more frequently to lymph node than patients with lower DKC1 expression levels. The role of DKC1 in cancer prognosis could be explained by the role of DKC1 in regulating mRNA translation 76 .
Furthermore, in the present study we selected direct interactions involving subtype-specific differentially expressed RBPs and DEGs. Although previous studies demonstrated that numerous lncRNAs are deregulated in different cancer types and RBPs could play a role to the deregulation of lncRNAs 31,77 . in our study we did not find interactions involving differentially expressed lncRNAs and RBPs specific for each subtype. Indeed, the Table 5. Performance of classification Support Vector Machine (SVM) and Random Forest (RF) using subtype-specific interactions (sensitivity, specificity and accuracy). www.nature.com/scientificreports/ final signature for each subtype is composed of only subtype-specific interacting genes and RBPs. This result can derive by some limitations of our study, such as: (i) lncRNA profiles based on TCGA data, which contain cancer cells and stromal cells could influence the results obtained by differential expression analysis, (ii) the low characterization of lncRNAs in the TCGA data. The characterization of lncRNAs could be more accurate with the new knowledge of data in the future. However, in our study we found interesting networks consisted of subtype-specific interacting genes and RBPs: a network of only 2 RBPs (RBM20 and PCDH20) and 2 genes (HOXB3 and RASSF7) for luminal A, a network of 21 RBPs and 53 DEGs for luminal B, a HER2-specific network of 14 RBPs and 30 DEGs, and a network of 54 RBPs and 302 DEGs for basal BC. From these networks we investigated the number of RNA targets for each RBP and the number of RBP for each RNA. We found that the number of RBPs per RNA and the number of RNAs per RBP increases with the aggressiveness of the BC molecular subtype. This finding could indicate the key role of the interactions between differentially expressed RBPs and DEGs in the progression of BC. Indeed, luminal A, the less aggressive BC subtype showed a lower number of RNA targets for each RBP and of RBP targets for each RNA. To our knowledge this is the first study that obtained a similar association. Encouraged by the results obtained that demonstrated the specific RBP-RNA interactions for each subtype we validated subtype-specific networks using a machine learning approach on an independent BC dataset from GEO. Overall, we obtained good performances of classification with an accuracy > 0.80 (Table 5). We achieved the best performances from the classification HER2 vs other subtypes (accuracy = 0.96) and basal vs other subtypes (accuracy = 0.96). Overall, given the good results of the classifier we propose the study of these BC subtype-specific interacting biomarkers as potential candidates for differential diagnosis of BC.
Among biomarkers, we found novel RBPs and genes that the survival analysis showed to have a prognostic role.
The low expression of RASSF7, a specific gene of luminal A, plays a prognostic role in BC as it is associated with a poor prognosis. To date, there is not a clear association of RASSF7 with BC.
The survival analysis in this study found that high expression of a specific gene of luminal B, DCTPP1, in a group of 609 BC patients is associated with a poor prognosis respect to 466 BC patients with a low expression. Although previous studies showed its role in DNA damage and genetic instability further studies are needed to investigate its potential therapeutic in BC 78 .
We found that DHRS11, KLC3, NAGS, and TMEM98, specific genes for HER2, are associated with a poor prognosis in BC patients. DHRS11 is implicated in the pathway of cytochrome P450, KLC3 in RHO GTPases activate KTN1M, NAGS in the urea cycle, and TMEM98 in oligodendrocyte differentiation 79 . However, to date there is not a clear association of these genes with BC.
We obtained ABHD14A and ADSSL1 as potential candidate prognostic biomarkers for basal BC. ABHD14A is associated with metabolic disorders of biological oxidation enzymes, and ADSSL1 with Purine ribonucleoside monophosphate biosynthesis 80 .
Although we propose several potential prognostic biomarkers for BC subtypes our study presents some limits. We selected in silico the biomarkers and validated them with a machine learning approach using an independent GEO dataset, and survival analysis. Molecular validation of these biomarkers will be performed in the near future as further studies are needed for translating them to clinical practice.

Conclusions
In this study, we firstly examined BC subtype-specific DEGs, differentially expressed LNCs and RBPs using BC-TCGA dataset. Then, we investigated the regulatory interactions between RBPs and their target genes in BC subtypes. We found different networks specific for each BC subtype: a network of 2 RBPs (RBM20 and PCDH20) and 2 genes (HOXB3 and RASSF7) for luminal A, a network of 21 RBPs and 53 DEGs for luminal B, a HER2-specific network of 14 RBPs and 30 DEGs, and a network of 54 RBPs and 302 DEGs for basal BC. Overall, the analysis sheds light on the role of RBPs in regulating different BC subtypes and we provided a data exploration analysis to aid future experimental studies. In addition, the analyses in this study suggested some novel prognostic BC biomarkers: RASSF7 for luminal A, DCTPP1 for luminal B, DHRS11, KLC3, NAGS, and TMEM98 for HER2, and ABHD14A and ADSSL1 for basal.